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Abstract 

In this paper we provide insight into the empirical properties of indirect cross- 
validation (ICV), a new method of bandwidth selection for kernel density estimators. 
First, we describe the method and report on the theoretical results used to develop 
a practical-purpose model for certain ICV parameters. Next, we provide a detailed 
description of a numerical study which shows that the ICV method usually outperforms 
least squares cross-validation (LSCV) in finite samples. One of the major advantages 
of ICV is its increased stability compared to LSCV. Two real data examples show the 
benefit of using both ICV and a local version of ICV. 



KEY WORDS: Cross-validation; Bandwidth selection; Kernel density estimation, In- 
tegrated Squared Error, Mean Integrated Squared Error. 

1 Introduction 

Let Xi, . . . , X n be a random sample from an unknown density /. A kernel density estimator 
of / at the point x is denned as 

1 A „/x-X- 



h 

1=1 

where h > is the bandwidth, and K is the kernel, which is generally chosen to be a 
unimodal probability density function that is symmetric about zero and has finite variance. 
A popular choice for K is the Gaussian kernel: 4>{u) = (27r) -1 / 2 exp(— u 2 /2). To distinguish 
between estimators with different kernels, we shall refer to estimator (00) with given kernel 
A" as a K-kernel estimator. 
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Practical implementation of the estimator ([I]) requires specification of the smoothing 
parameter h. The two most widely used bandwidth selection methods are least squares cross- 



validation, proposed independently by Rudemo (1982) and Bowman (1984) , and the Sheather and Jones (l! 
plug-in method. Plug-in is often preferred since it produces more stable bandwidths than 
does LSCV. Nevertheless, the LSCV method is still popular since it requires fewer assump- 
tions than the plug-in method and works well when the density is difficult to estimate; 
see 



Loader (1999), van Es (1992), and Sain, Baggerly, and Scott (1994) 



The main flaw of LSCV is high variability of the selected bandwidths. Other drawbacks 
include the tendency of cross-validation curves to exhibit multiple local minima with the first 



local minimum being too small (see Hall and Marron (1991)), and the tendency of LSCV 
to select bandwidths that are much too small when the data exhibit a small amount of 



autocorrelation (see Hart and Vieu (1990) and Cao and Vilar Fernandez (1993) for results 
of a numerical study). Many modifications of LSCV have been proposed in an attempt to 



improve its performance. These include biased cross-validation of Scott and Terrell (1987) 



a method of Chiu (1991), the trimmed cross-validation of Feluch and Koronacki (1992), the 



modified cross-validation of Stute (1992), and the method of Ahmad and Ran (2004) based 
on kernel contrasts. 

This paper is concerned with a new modification of the LSCV method, called indirect 



cross-validation (ICV), recently proposed by the authors Savchuk, Hart, and Sheather (2008) 



The ICV method depends on two parameters, a and a. A main theoretical result is that at 
asymptotically optimal choices of a and a the ICV bandwidth can converge to zero at a rate 
n -1 / 4 , which is substantially better than the rate of LSCV. The present paper contains 

the results of an empirical study of ICV. In Section [2] we provide a description of the method. 
Section [3] contains the details underlying the development of a practical purpose model for 
a and a. Section H] outlines the results of a numerical study which, in particular, show that 
ICV has greater stability in finite samples than does LSCV. In Section [5] we apply ICV and 
a local version of ICV to real data sets. Section [H] provides a summary of our results. 
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2 Description of indirect cross-validation 



2.1 Notation and definitions 

We begin with some notation and definitions that will be used subsequently. For an arbitrary 
function g, define 

R{g) = J g(u) 2 du, /j jg = J u 3 g(u)du, 

where here and subsequently integrals are assumed to be over the whole real line. The 
popular measures of performance of the kernel estimators ([1]) are integrated squared error 
(ISE) and mean integrated squared error (MISE). The ISE is defined as 

ISE(h) = J(f h (x)-f(x)) 2 dx, (2) 

and MISE is defined as the expectation of ISE. Assuming that the underlying density / 
has second derivative which is continuous and square integrable and that R{K) < oo, the 
bandwidth which asymptotically minimizes the MISE of the K— kernel estimator ([1]) has the 
following form: 

t R{,<) 1"V"S. (3) 



h " UIkRU" 

The LSCV criterion is given by 
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LSCV(h) = R(f h ) fh,-i(Xi), (4) 



n 

i=l 



where fh,-% denotes the kernel estimator ([I]) constructed from the data without the obser- 
vation Xj. A well known fact is that LSCV(h) is an unbiased estimator of MISE(h) — 
f f 2 (x) dx. For this reason the LSCV method is often called unbiased cross-validation. Let 
hucv an d h denote the bandwidths which minimize the LSCV function (jl]) and the MISE 
of the 0-kernel estimator. Section I2~2l defines the ICV bandwidth, denoted as h IC v- 



2.2 The basic method 

The essence of the ICV method is to use different kernels at the cross-validation and density 
estimation stages. The same idea is exploited by the one-sided cross-validation method 



of Hart and Yi (1998) in the regression context. ICV first selects the bandwidth of an 
L— kernel estimator using least squares cross-validation. Selection kernels L used for this 
purpose are described in Section 12.31 The bandwidth so obtained is rescaled so that it can 
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be used with the 0-kernel estimator. The multiplicative constant C has the following form: 

c = ( 2v /^w) ' (5) 

which is motivated by the asymptotically optimal MISE bandwidth ([3]). 



2.3 Selection kernels 

We consider the family of kernels C = {L( ■ ; a, a) : a > 0, a > 0}, where, for all u, 

L(u; a, a) = (1 + a)4>(u) ( — ) . (6) 

Note that the Gaussian kernel is a special case of ([6]) when a = or a = 1. Each member of L 
is symmetric about and has the second moment /i2L = / u 2 L{u) du = 1 + a — aa 2 . It follows 
that kernels in C are second order, with the exception of those for which a = + a) /a. 

The family L can be partitioned into three families: C%, £2 and £3. The first of these is 
C\ = \L(-; a, a) : a > 0, a < j^}- Each kernel in Ci has a negative dip centered at x — 0. 
The kernels in L\ are ones that "cut-out-the-middle," some examples of which are shown in 
Figure [T](a). 

The second family is £ 2 = {£(•; a, a) : a > 0, < o < l}. Kernels in £ 2 are densities 
which can be unimodal or bimodal. Note that the Gaussian kernel is a member of this family. 
The third family is £3 = {L(-;q;, a) : a > 0,a > 1}, each member of which has negative 
tails. Examples are shown in Figure [0(b). 

Kernels in Ci and £3 turn out to be highly efficient for cross-validation purposes but very 
inefficient for estimating /. This explains why we do not use L as both a selection and an 
estimation kernel. 

Selection kernels in £ are mixtures of two normal densities, which greatly simplifies 
computations. In particular, closed form expressions exist for the LSCV and ISE functions. 



This fact has been utilized by Marron and Wand (1992) to derive exact MISE expressions. 



Marron and Wand (1992) point out that, in addition to their computational advantages, 
normal mixtures can approximate any density arbitrarily well in various senses. Mixtures of 
normals are therefore an excellent model for use in simulation studies, a fact which we take 
advantage of in Section |U 
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(a) 



(b) 





Figure 1: (a) Selection kernels in C\ which have o = 0.5; (b) Selection kernels in £3 with 
a = 6. The dotted curve in both graphs corresponds to the Gaussian kernel. 

3 Practical issues 

In this section we address the problem of choosing the parameters, a and a, of the selection 
kernel in practice. We review some large sample theory for the ICV method and provide the 
theoretical results used to develop the practical-purpose model for a and a. 



3.1 Large sample theory 



Large sample theory was developed in Savchuk, Hart, and Sheather (2008) by considering 
the asymptotic mean squared error (MSE) of the ICV bandwidth. Their results may be 
summarized as follows. 



1. Under suitable regularity conditions the ICV bandwidth is asymptotically normally 
distributed. 

2. The asymptotic MSE of hicv has been found for two cases: a — > (cut-out-the- 
middle kernels) and a — > 00 (negative-tailed kernels). It turns out that when the 
asymptotically optimal values of a and a are used in the respective cases, the MSE 
converges to zero at the same rate of n _9//10 , but the limiting ratio of optimum mean 
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squared errors is 0.752, with a — > oo yielding the smaller error. In comparison, the 
rate at which the MSE for hucv converges to zero is n _6//10 . 

The subsequent theoretical results are provided for the case a — > oo. 

3. The relative rate of convergence of hicv t° is ^~ 1//4 , whereas the corresponding rate 
for hucv is n^ 1 ^ 10 . 

4. Values of o which minimize the asymptotic MSE are as follows: 

13/51 5/8 



rr/\2 



(7) 



27/16 a 3/4 / y g | 

where A a = K^^-^ ^(1 + „)• - _(1 + a) + - j . 

5. The asymptotically optimal a is 2.4233. Remarkably, the optimal a does not depend 
on /. 

6. When the asymptotically optimal values of a and a are used, the asymptotic bias and 
standard deviation of h IC v converge to zero at the same rate of n~ 9 / 20 . 

3.2 MSE-optimal a and a 

Asymptotic results are not always reliable for practical purposes. In order to have an idea 
of whether the negative-tailed or cut-out-the middle kernels should really be used, and how 
good choices of a and o vary with n and /, we considered the following expression for the 
asymptotic MSE of the ICV bandwidth: 

1/5 R{fl) 2 ^_ 3/s | 2 R{f)R{f y3/S R{pL ) 



MSE(W)- [^) i2(/«)i6/5 n " (25 R{f"'f W /5 fe) 1/5 + 

400 V Qj&V* ( 47r )V5 

Expression (jSJ) is valid for either large or small values of a and includes second order bias 
terms. 

As our target densities we considered the following five normal mixtures defined in the 



article by Marron and Wand (1992) 



6 





Density 








skewed 






separated 


skewed 




normal 


unimodal 


bimodal 


bimodal 


bimodal 


n 


a 


a 


a 


a 


a 


a 


a 


a 


a 


a 


100 


3.05 


2.79 


5.28 


1.68 


109.68 


1.03 


16.70 


1.19 


343.74 


1.01 


250 


2.78 


4.04 


3.16 


2.60 


48.46 


1.06 


4.51 


1.84 


177.15 


1.02 


500 


2.73 


4.97 


2.84 


3.56 


6.21 


1.55 


3.18 


2.58 


161.39 


1.02 


1000 


2.69 


5.97 


2.75 


4.49 


3.73 


2.12 


2.84 


3.54 


123.78 


1.03 


5000 


2.61 


8.84 


2.66 


6.85 


2.77 


4.26 


2.70 


5.74 


4.71 


1.79 


20000 


2.55 


12.40 


2.59 


9.58 


2.68 


6.22 


2.63 


8.08 


2.85 


3.46 


100000 


2.50 


18.80 


2.53 


14.27 


2.60 


9.19 


2.56 


11.94 


2.70 


5.65 


500000 


2.47 


29.54 


2.49 


21.88 


2.54 


13.65 


2.50 


18.07 


2.62 


8.39 



Table 1: MSE-optimal a and a. 



Gaussian density: N(0, 1) 

Skewed unimodal density: ±JV(0, 1) + §iv(±, (f) 2 ) + §iv(±§, (f) 1 

Bimodal density: l^f" 1 ; (|) 2 ) + (§) 2 
Separated bimodal density: — |, (§) J + |AM§, (|)' 

Skewed bimodal density: |JV(0, 1) + JivYf, (|) 2 ). 

These choices for / represent density shapes that are common in practice. 

In Tabled] we provide the MSE-optimal choices of a and a for the target densities at eight 
sample sizes ranging from n = 100 up to n = 500000. It is obvious that the MSE-optimal a 
and a vary greatly from one density to another, which is especially true for "small" sample 
sizes. However, the optimal a seems to converge to about 2.5 for each density as n increases, 
which fits with our observation that the optimal a is 2.4233. The optimal a is increasing 
with sample size. It us remarkable that all the MSE-optimal a and a in Table [T] correspond 
to kernels from £3, the family of negative-tailed kernels. 



3.3 Model for the ICV parameters 

We found a practical purpose model for a and a by using polynomial regression. Our 
independent variable was log 10 (n) and our dependent variables were the MSE-optimal values 
of log 10 (a) and log 10 (cr) for different densities. The log 10 transformations for a and a stabilize 
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n 


100 


250 


500 


1000 


5000 


20000 


100000 


500000 


®-mod 


25.20 


12.77 


8.24 


5.71 


3.23 


2.66 


2.66 


2.62 


O~mod 


1.39 


1.89 


2.37 


2.95 


4.83 


7.21 


11.22 


16.98 



Table 2: Model choices of a and a. 

variability. Using a sixth degree polynomial for a and a quadratic for a, we arrived at the 
following models for a and a: 



(9) 



n — 1 n3.390-l. 093 log 10(n)+0.025 log 10(n) 3 -0.00004 log 10(n) 6 
"mod — lu 

„ _ 1 n-0.58+0.3861ogl0(n)-0.0121ogl0(n) 2 
'-'mod — iu j 

which are appropriate for 100 < n < 500000. The MSE-optimal values of log 10 (a) and a 
together with the model fits are shown in Figure [2J In Table [2] we give the model choices 
a mo d and a mo d for the same sample sizes as in Table [TJ 



4 Simulation study 

The primary goal of our simulation study was to compare ICV with ordinary LSCV. However, 
we will also provide simulation results for the Sheather- Jones plug-in method. 

We considered the four sample sizes n = 100, 250, 500 and 5000, and took samples from 
the target densities listed in Section 13.21 For each combination of density and sample size 
we did 1000 replications. In all cases the parameters a and a in the selection kernel L were 
chosen according to model (jUJ). 

Let ho denote the minimizer of ISE(h) for a Gaussian kernel estimator. For each sam- 
ple, we computed h , h* ICV , hucv and the Sheather- Jones plug-in bandwidth hsjpi- The 
definition of h* ICV is as follows: 

h* ICV = mm(h IC v, hos), (10) 



where hos is the oversmoothed bandwidth of Terrell (1990) It is arguable that no data- 
driven bandwidth should be larger than hos since this statistic estimates an upper bound 
for all MISE-optimal bandwidths (under standard smoothness conditions). 

For any random variable Y defined in each replication of our simulation, we denote the 
mean, standard deviation and median of Y over all replications (with n and / fixed) by E(F), 
SD(V) and Median(V). To evaluate the bandwidth selectors we computed ~E{lSE(h) / ISE(h ) } 
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gure 2: MSE-optimal log 10 (o;) and a and the model fits. 



n = 100 



n = 250 




LSCV SJPI ICV ISE LSCV SJPI ICV 



Figure 3: Boxplots for the data-driven bandwidths in case of the Normal density. 

and Medi&n{lSE(h)/ISE(ho)} for h equal to each of h* ICV , hucv and hsjpi- We also com- 
puted the performance measure E (h — E(ho)^J , which estimates the MSE of the bandwidth 
h. 

Our main simulation results for the "normal" and "bimodal" densities, as defined in 
Section 13.21 are given in Tables [3] and H] and Figures [3] and HI Results for the other densities 
are available from the authors. Other statistics reported in Tables |3] and H] are E(/i) and 
SD(/i) for each type of bandwidth considered. 

The reduced variability of the ICV bandwidth is evident in our study. The ratio SD(h* ICV )/SD(hucv) 
ranged between 0.9713 and 0.2103 in the twenty settings considered. However, the variances 
of the ICV bandwidths were always higher compared to the Sheather- Jones plug-in band- 
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n 


LSCV 


SJPI 


ICV 


ISE 


E(h) 


100 


0.44524596 


0.39338747 


0.41530230 


0.43162318 


250 


0.36398008 


0.33883538 


0.34944737 


0.35487029 


500 


0.31094126 


0.29803205 


0.30864570 


0.30806146 


5000 


0.18359629 


0.18992356 


0.19768683 


0.19526358 


SD(h) ■ 10 2 


100 


12.32173263 


6.43244579 


6.52298637 


7.52008697 


250 


8.35772162 


3.71742374 


4.44775700 


6.27300326 


500 


7.11168918 


2.60300987 


3.08015801 


5.63495059 


5000 


3.90077096 


0.61900268 


0.82041632 


3.09277421 


E(h - E(h )) 2 ■ 10 4 


100 


153.52907115 


55.95467615 


45.17051435 




250 


70.61154173 


16.37660421 


20.05684094 




500 


50.60847941 


7.77477660 


9.48129936 




5000 


16.56205491 


0.66793916 


0.73113122 




E(lSE(/t)/ISE(/t )) 


100 


2.46997542 


1.90795915 


1.72178966 




250 


1.91593730 


1.50563016 


1.47567596 




500 


1.75806058 


1.37734003 


1.36096679 




5000 


1.41316047 


1.11460567 


1.10313807 




Median(lSE(/t)/ISE(ft, )) 


100 


1.31108630 


1.15695876 


1.11233574 




250 


1.21715835 


1.10408948 


1.09365380 




500 


1.21396609 


1.10306404 


1.09608944 




5000 


1.10907960 


1.04471055 


1.05183075 





Table 3: Simulation results for the Gaussian density. 
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n 


LSCV 


SJPI 


ICV 


ISE 


E(h) 


100 


0.42908686 


0.39453431 


0.41955286 


0.38237337 


250 


0.31360942 


0.31160054 


0.32846189 


0.29715278 


500 


0.25927533 


0.26238646 


0.27450416 


0.25320682 


5000 


0.15262210 


0.15706804 


0.16255246 


0.15478049 


SD(h) ■ 10 2 


100 


13.56532316 


7.44425312 


9.56680379 


7.60899932 


250 


8.46734473 


4.18778288 


6.50918853 


4.29431763 


500 


5.70587208 


2.44443305 


4.20078840 


3.55982408 


5000 


2.46293965 


0.47951752 


0.81457083 


1.96503777 


E(h - E(h )) 2 ■ 10 4 


100 


205.65547766 


56.84037076 


105.25535253 




250 


74.33244070 


19.60736507 


52.12977298 




500 


32.89268754 


6.81193546 


22.16474597 




5000 


6.10659189 


0.28203637 


1.26689717 




E(lSE(/t)/ISE(/t )) 


100 


1.69951929 


1.32733595 


1.36143018 




250 


1.51599857 


1.20914143 


1.28743335 




500 


1.41670996 


1.15070890 


1.19168891 




5000 


2.06430484 


1.06839987 


1.07675906 




Median(lSE(/t)/ISE(h )) 


100 


1.20951575 


1.08744161 


1.13356965 




250 


1.16087896 


1.08338970 


1.12699702 




500 


1.12243694 


1.06072702 


1.09421867 




5000 


1.05825025 


1.03067963 


1.03649944 





Table 4: Simulation results for the Bimodal density. 
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n = 100 



n = 250 



LSCV SJPI ICV ISE 



LSCV SJPI ICV ISE 



n = 500 



n = 5000 



LSCV SJPI 



T 



LSCV SJPI ICV ISE 



ure 4: Boxplots for the data-driven bandwidths in case of the Bimodal density. 
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Figure 5: Kernel density estimates for random bandwidths from the simulation with the 
Skewed Unimodal density and n = 250. 

widths. It is worth noting that the ratio of sample standard deviations of the ICV and LSCV 
bandwidths decreases as the sample size n increases. 

The mean squared distance E (h — E(^o)l was smaller for the ICV method than for the 
LSCV method in all but two cases corresponding to the Skewed Bimodal density, n = 250 
and 500. Plug-in always had a smaller value of E (ji — E(/i ) j than did ICV. 

The most important observation is that the values of E(7 S E(h) / 1 S E(h )} were smaller 
for ICV than for LSCV for all combinations of densities and sample sizes. The values of 
Median(lSE(h)/ISE(h )) were smaller for ICV than for LSCV in all but one case, which 
corresponds to the Skewed Bimodal density at n = 250 when Media.ia(lSE(hjcv)/ISE(h )) 
was 1.0013 times greater than Median^ISE(hucv)/ ISE(ho))- 



Despite the fact that the LSCV bandwidth is asymptotically normally distributed (see Hall and Marron 
its distribution in finite samples tends to be skewed to the left. In our simulations we have 
noticed that the distribution of the ICV bandwidth is less skewed than that of the LSCV 
bandwidth. A typical case is illustrated in Figure [5], where kernel density estimates for the 
two data-driven bandwidths are plotted from the simulation with the Skewed Unimodal den- 
sity at n = 250. Also plotted is a density estimate for the ISE-optimal bandwidths. Note 
that the ICV density is more concentrated near the middle of the ISE-optimal distribution 
than the density estimate for LSCV. 

Figure [6] provides scatterplots of the bandwidths hucv and hicv versus h Q in the case 
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(a) 



(b) 




.7. ' . • 



Figure 6: Scatterplots of h vs. ho for the case of a Gaussian density and n = 500, with h 
corresponding to the (a) LSCV and (b) ICV bandwidths. 



of the Gaussian density and n = 500. The sample correlation coefficients were -0.52 and 
-0.60 for LSCV and ICV, respectively. The fact that these correlations are negative is a 



well-established phenomenon; see, for example, Hall and Johnstone (1992) Note that the 
ICV bandwidths cluster more tightly about the MISE minimizer ho = 0.315. 

A problem we have noticed with the ICV method is that its criterion function can have 
two local minima when the sample size is moderate and the density has two modes. The 
following example illustrates the problem. In Figure 0(a) we have plotted three ICV curves 
for the case of the Separated Bimodal density and n = 100. The minimizers of the solid, 
dashed and dotted lines occur at the h- values 0.2991, 2.0467 and 0.2204, respectively. For 
comparison, the corresponding bandwidths chosen by the Sheather- Jones plug-in method are 
0.3240, 0.2508 and 0.2467. The value of h = 2.0467 which minimizes the dashed ICV curve 
is obviously too large. The local minimum at 0.1295 would yield a much more reasonable 
estimate. The problem of choosing too large a bandwidth from the second local minimum is 
mitigated by using the rule (ITUj) . Indeed, the oversmoothed bandwidths for the three samples 
are shown by the vertical lines in Figure [7] and were 0.7404, 0.7580 and 0.7341. Note that 
the problem with the ICV curve having two local minima of approximately the same value 
quickly goes away as the sample size increases. This is illustrated in Figure [7(b), where we 
have plotted three criterion curves for the Separated Bimodal case with n = 500. Thus, the 
selection rule h* ICV given by flTOT) rather than just hicv appears to be useful mostly for small 
and moderate sample sizes. 
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Figure 7: Three ICV criterion functions in case of the Separated Bimodal density at (a) 
n = 100 and (b) n = 500. 



5 Real data examples 

In this section we show how the ICV method works on two real data sets. The purpose 
of the first example is to compare the performance of the ICV, LSCV, and Sheather- Jones 
plug-in methods. The second example illustrates the benefit of using ICV locally. 



5.1 PGA data 

In this example the data are the average numbers of putts per round played, for the top 
175 players on the 1980 and 2001 PGA golf tours. The question of interest is whether there 
has been any improvement from 1980 to 2001. This data set has already been analyzed 



by Sheather (2004) in the context of comparing the performances of LSCV and Sheather- 
Jones plug-in. 

In Figure [8] we have plotted an unsmoothed frequency histogram and the LSCV, ICV and 
Sheather- Jones plug-in density estimates for a combined data set of 1980 and 2001 putting 
averages. The class interval size in the unsmoothed histogram was chosen to be 0.01, which 
corresponds to the accuracy to which the data have been reported. There is a clear indication 
of two modes in the histogram. 

The estimate based on the LSCV bandwidth is apparently undersmoothed. The ICV 
and plug-in estimates look similar and have two modes, which agrees with evidence from the 
unsmoothed histogram and seems reasonable since the data were taken from two populations. 
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Unsmoothed frequency histogram 



LSCV density estimate 




Figure 8: Unsmoothed frequency histogram and kernel density estimates for average numbers 
of putts per round from 1980 and 2001 combined. 

In Figure [9] we have plotted kernel density estimates separately for the years 1980 and 
2001. ICV seems to produce a reasonable estimate in both years, whereas LSCV yields a 
very wiggly and apparently undersmoothed estimate in 2001. 

5.2 Local ICV example 



Local cross-validation methods for density estimation, independently proposed by Hall and Schucany (1989 



and Mielniczuk, Sarda, and Vieu (1989), consist in performing LSCV at each value of the 
argument x using a fraction of the data that are close to x. Allowing the bandwidth to 
depend on x is desirable when the smoothness of the underlying density changes sufficiently 
with x. 
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Average Number of Putts 



Figure 9: Kernel density estimates based on LSCV (dashed curve) and ICV (solid curve) 
produced separately for the data from 1980 and 2001. 



The local ICV method was introduced in Savchuk, Hart, and Sheather (2008) It is dif- 
ferent from the local LSCV method in that it uses ICV rather than LSCV for the local 
bandwidth selection. Another difference is that local ICV uses the first local minimizer of 
the local criterion function as opposed to the global minimizer of local LSCV. 

The local ICV criterion function is defined as 

ICV(x,b,w) = - U f— ) f b (ufdu-—f2<p(^^) / M (X<), 
w J \ w J nw \ w J 

where function ft, is the kernel density estimate based on a selection kernel L with a smooth- 
ing parameter b. The quantity w defines the extent to which the cross-validation is local, 
with a large choice of w corresponding to global ICV. Let b(x) be the first local mini- 
mum of the local ICV curve for the fixed value of x. Then the corresponding bandwidth 
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Figure 10: Density estimates for the DC data set with (a) being the global ICV density 
estimate and (b) corresponding to the local ICV estimate. 



of a 0— kernel estimator is defined as h(x) = Cb(x), where C is computed as in ((Sj). Lo- 
cal ICV outperformed the local LSCV method in a simulated data example in the article 



of Savchuk, Hart, and Sheather (2008) In this paper we show how local ICV and LSCV 
perform in a real data example. 

We analyze the data of size n = 517 on the Drought Code (DC) of the Canadian Forest 
Fire Weather index (FWI) system. DC is one of the explanatory variables which can be 
used to predict the burned area of a forest in the Forest Fires data set. This data can be 



downloaded from the website http : / /archive . ics . uci . edu/ml/datasets/Forest+Fires 



The data were collected and analyzed by Cortez and Morais (2007) 



We computed the LSCV, ICV and Sheather- Jones plug-in bandwidths for the DC data. 
The LSCV method failed by yielding hucv — 0- ICV and Sheather- Jones plug-in bandwidths 
were very close and produced similar density estimates. Figure [10] (a) gives the ICV density 
estimate. It shows two major modes connected with a wiggly curve, which indicates that 
varying the bandwidth with x may yield a smoother estimate of the underlying density. 

Local ICV and LSCV have been applied to the DC data. We used w = 40 for both 
methods and the selection kernel with a = 6 and a = 6 for local ICV. This (a, a) choice 
performed quite well for unimodal densities in our simulation studies on global ICV, and 
hence seems to be reasonable for local bandwidth selection since locally the density should 
have relatively few features. Let xu\, i = 1, . . . ,n, denote the ith member of the ordered 
sequence of observations. The local ICV and LSCV bandwidth were found for 50 evenly 
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spaced points in the interval x^ — 0.2(x( n ) — X(i)) < x < X( n ) + 0.2(x( n ) — It turns out 

that in 45 out of 50 cases the local LSCV curve tends to — oo as h — > 0, which implies that 
the local LSCV estimate can not be computed. All 50 local ICV bandwidths were positive. 
We found a smooth function h(x) by interpolating at other values of x via a spline. The 
corresponding local ICV estimate, given in Figure [TUlfb). shows a smoother density estimate. 

6 Summary 

Indirect cross-validation is a method of bandwidth selection in the univariate kernel density 
estimation context. The method first selects the bandwidth of an L— kernel estimator by least 
squares cross-validation, and then rescales this bandwidth so that it is appropriate for use in a 
Gaussian kernel density estimator. Selection kernels L have the form (l+a)<p(u)— a<f)(u/a) / a, 
where a > 0, o > and is the Gaussian kernel. Optimal kernels from this class yield 
bandwidths with relative error that converges to at a rate of n~ 1//4 , which is a substantial 
improvement over the n~ 1/w rate of LSCV. 

A practical purpose model for the selection kernel parameters, a and a, has been devel- 
oped. The model was built by performing polynomial regression on the MSE-optimal values 
of log 10 (a) and log 10 (cx) at different sample sizes for five target densities. Use of this model 
makes the ICV method completely automatic. 

An extensive simulation study showed that in finite samples ICV is more stable than 
LSCV. Although both ICV and LSCV bandwidths are asymptotically normal, the distribu- 
tion of the ICV bandwidth for finite n is usually more symmetric and better concentrated in 
the middle of the density for ISE-optimal bandwidths. Using an oversmoothed bandwidth 
as an upper bound for the bandwidth search interval reduces the bias of the method and 
prevents selecting an impractically large value of h when the criterion curves exhibit multiple 
local minima. 

The ICV method performs well in real data examples. ICV applied locally yields density 
estimates which are more smooth than estimates based on a single bandwidth. Often, local 
ICV estimates may be found when the local LSCV estimates do not exist. 
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